Analyzing neural responses to natural signals: 
Maximally informative dimensions 



Tatyana Sharpee/ Nicole C. Rust,^ and William Bialek^'^ 

^ Sloan-Swartz Center for Theoretical Neurobiology and Department of 
Physiology, University of California at San Francisco, San Francisco, CA 94143 
^ Center for Neural Science, New York University, New York, NY 10003 
^ Department of Physics, Princeton University, Princeton, New Jersey 08544 
sharpee@phy.ucsf.edu, rust@cns.nyu.edu, wbialek@princeton.edu 



February 2, 2008 



We propose a method that allows for a rigorous statistical analysis of neu- 
ral responses to natural stimuli which are non-Gaussian and exhibit strong 
correlations. We have in mind a model in which neurons are selective for 
a small number of stimulus dimensions out of a high dimensional stimulus 
space, but within this subspace the responses can be arbitrarily nonlinear. Ex- 
isting analysis methods are based on correlation functions between stimuli 
and responses, but these methods are guaranteed to work only in the case of 
Gaussian stimulus ensembles. As an alternative to correlation functions, we 
maximize the mutual information between the neural responses and projec- 
tions of the stimulus onto low dimensional subspaces. The procedure can be 
done iteratively by increasing the dimensionality of this subspace. Those di- 
mensions that allow the recovery of all of the information between spikes and 
the full unprojected stimuli describe the relevant subspace. If the dimension- 
ality of the relevant subspace indeed is small, it becomes feasible to map the 
neuron's input-output function even under fully natural stimulus conditions. 
These ideas are illustrated in simulations on model visual and auditory neu- 
rons responding to natural scenes and sounds, respectively. 

1 Introduction 



From olfaction to vision and audition, a growing number of experiments are examin- 
ing the responses of senso ry neurons to natural stimuli ^ reutzfeldt & Nor thdurft, 1978; 
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range of neural responses may require using stimulus ensembles which approximate 
those occurring in nature (,Rieke et al.- .1997: .Simoncelli & Olshausert .2001,) . and it is an 
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attractive hypothesis t hat the neura l representation of these natural signals may be op 
timized in some way (lBarlowLE96ll. 1200 Ic Ivon der Twer & MacleodL I2OO1I: iBialeld. l2002|) 



Many neurons exhibit strongly nonlinear and adaptive responses that are unlikely to be 
predicted from a combination of responses to simple stimuli; for example neurons have 
been shown to adapt to the distribution of sensory inputs, so tha t any characterization 
of these responses will depend on context (jSmirnakis et al. , 19961: iBrenner et al. , 2000aL 



Fairhall et al. I. l200l|) . Finally, the variability of a neuron's respons es decreases substan 



tially when complex dynamical, rathe r than static, st imuli are used ainen & Seinowskijj 



1995 : lde Ruyter van Steveninck et aD.[l997: Kara et al.. 2000: de Ruyter van Steveninck et al 
200 1|) . All of these arguments point to the need for general tools to analyze the neural re- 
sponses to complex, naturalistic inputs. 

The stimuli analyzed by sensory neurons are intrinsically high dimensional, with di- 
mensions D ~ 10^ — 10^. For example, in the case of visual neurons, the input is commonly 
specified as light intensity on a grid of at least 10 x 10 pixels. Each of the presented stim- 
uli can be described as a vector s in this high dimensional stimulus space, see Fig. [ij The 
dimensionality becomes even larger if stimulus history has to be considered as well. For 
example, if we are interested in how the past frames of the movie affect the proba- 
bility of a spike, then the stimulus s, being a concatenation of the past N samples, will 
have dimensionality times that of a single frame. We also assume that the probability 
distribution P(s) is sampled during an experiment ergodically, so that we can exchange 
averages over time with averages over the true distribution as needed. 

Even though direct exploration of a -D ~ 10^ — 10'^ dimensional stimulus space is 
beyond the constraints of experimental data collection, progress can be made provided 
we make certain assumptions about how the response has been generated. In the simplest 
mode l, the p robability of response can be described by one receptive field (RF) or linear 



filter ( Rieke et al.L Il997i) . The receptive field can be thought of as a template or special 



direction ei in the stimulus space^ such that the neuron's response depends only on a 
projection of a given stimulus s onto ei, although the dependence of the response on 
this projection can be strongly nonlinear, cf. Fig. [U In this simple model, the reverse 
correlation method ( de Boer & Kuyper, 1968; Rieke et al.l ll997l: (chichilnisky. 2001) can be 
used to recover the vector ci by analyzing the neuron's responses to Gaussian white noise. 
In a more general case, the probability of the response depends on projections = Cj ■ s 
of the stimulus s on a set of K vectors {ei, 62, ... , ex}: 

P(spike|s) = P(spike)/(si, S2, sk), (1) 

where P (spike |s) is the probability of a spike given a stimulus s and P (spike) is the aver- 
age firing rate. In what follows we will call the subspace spanned by the set of vectors 
{ei, 62, ••• , ex} the relevant subspace (RS)^. We reiterate that vectors {cj}, 1 < i < K may 

^The notation e denotes a unit vector, since we are interested only in the direction the 
vector specifies and not in its length. 

^Since the analysis does not depend on a particular choice of a basis within the full 
D-dimensional stimulus space, for clarity we choose the basis in which the first K basis 
vectors span the relevant subspace and the remaining D — K vectors span the irrelevant 
subspace. 
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also describe how the time dependence of the stimulus s affects the probability of a spike. 
An example of such a relevant dimension would be a spatiotemporal receptive field of 
a visual neuron. Even though the ideas developed below can be used to analyze input- 
output f unctions f wit h respect to different neural res ponses, such as pattern s of sp ikes 
in tim e ( de Ruyter van Steveninck & Bialek, 1988; Bren ner et al. , 2000bl:lReina gel & Reidi 



for illustration purposes we choose a single spike as the response of interest. 



Equation ^ in itself is not yet a simplification if the dimensionality K of the RS 
is equal to the dimensionality D of the stimulus space. In this paper we will assume 
that the neuron's firing is sensitive only to a small number of stimulus features, i.e. 
K <^ D. While the general idea of searching for low dimensional structure in high 
dimensional data is very old, our motivation here comes from work on the fly visual 
system where it was shown explicitly that patterns of action potentials in identified mo- 
tion sensitive neurons a re correlated with low dimensional projecti ons of the h igh di- 
mensional visual i nput ide Ruvter van Steveninck & Bialek. 1988; Brenner et al.i, 2000ai; 
feialek & de Ruvter van Steveninckl.l2nn3^ . The input-output function / in Eq. Q can be 
strongly nonlinear, but it is presumed to depend only on a small number of projections. 
This assumption appears to be less stringent than that of approximate linearity which one 
makes when characterizing neu ron's resp onse in terms of Wiener kernels [see for example 



the discussion in Section 2.1.3 of iRieke et al., (,1997 )]. The most difficult part in reconstruct- 
ing the input-output function is to find the RS. Note that for > 1, a description in terms 
of any linear combination of vectors {ei, 62, ... , ck} is just as valid, since we did not make 
any assumptions as to a particular form of the nonlinear function /. 

Once the relevant subspace is known, the probability P(spike|s) becomes a function 
of only a few parameters, and it becomes feasible to map this function experimentally, 
inverting the probability distributions according to Bayes' rule: 

rf X P(si,S2, ...,SK|spike) 

f{Si,S2,...,SK) = ^7 ^ — . (2) 

P{Si,S2,...,Sk) 

If stimuli are chosen from a correlated Gaussian noise ensembl e, then the neural response 

can be characterized by the spike-triggered covariance method (|de Ruyter van Steveninck & Dialed 

^We emphasize that our focus here on single spikes is not equivalent to assuming that 
the spike train is a Poisson process modulated by the stimulus. No matter what the sta- 
tistical structure of the spike train is we always can ask what features of the stimulus are 
relevant for setting the probability of generating a single spike at one moment in time. 
From an information theoretic point of view, asking for stimulus features that capture the 
mutual information between the stimulus and the arrival times of single spikes is a well 
posed question even if successive spikes do not carry independent information; note also 
that spikes carrying independent information is not the same as spikes being generated 
as a Poisson process. On the other hand, if (for example) different temporal patterns of 
spikes carry information about different stimulus features, then analysis of single spikes 
will result in a relevant subspace of artefactually high dimensionality. Thus it is impor- 
tant that the approach discussed here carries over without modification to the analysis of 
relevant dimensions for the generation of any discrete event, such as a pattern of spikes 
across time in one cell or synchronous spikes across a population of cells. For a related 
discussion of relevant dimensions and spike patterns using covariance matrix methods 
see (,de Ruyter van Steveninck & Bialek,.1988: Agtiera y Areas et al., 2003) . 
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Figure 1: Schematic illustration of a model with a one-dimensional relevant subspace: ei 
is the relevant dimension, and 62 and 63 are irrelevant ones. Shown below are three ex- 
ample stimuli, s, s', and s", the receptive field of a model neuron — the relevant dimension 
ei, and our guess v for the relevant dimension. Probabilities of a spike P (spike |s ■ ei) and 
P(spike|s ■ v) are calculated by first projecting all of the stimuli s onto each of the two 
vectors ei and v, respectively, and then applying Eqs. (|5I2I1|I sequentially. Our guess v for 
the relevant dimension is adjusted during the progress of the algorithm in such a way as 
to maximize /(v) of Eq. (|7|l, which makes vector v approach the true relevant dimension 
ei. 



1988':'Br enner et al.l.Efflnal:ISchwartz et al.l.l20n2l:fTouryan et al]. l2nn2':'Biale k & de Ruyter van Steveninck 
2003 ). It can be shown that the dimensionality of the RS is equal to the number of nonzero 
eigenvalues of a matrix given by a difference between covariance matrices of all presented 
stimuli and stimuli conditional on a spike. Moreover, the RS is spanned by the eigen- 
vectors associated with the nonzero eigenvalues multiplied by the inverse of the a priori 
covariance matrix. Compared to the reverse correlation method, we are no longer limited 
to finding only one of the relevant dimensions {ci}, I < i < K. Both the reverse correla- 
tion and the spike-triggered covariance method, however, give rigorously interpretable 
results only for Gaussian distributions of inputs. 

In this paper we investigate whether it is possible to lift the requirement for stimuli to 
be Gaussian. When using natural stimuli, which certainly are non-Gaussian, the RS can- 
not be found by the spike-triggered covariance method. Similarly, the reverse correlation 
method does not give the correct RF, even in the simplest case where the input-output 
function in Eq. (jT)) depends only on one projection; see Appendix |X| for a discussion of 



4 



this point. However, vectors that span the RS are clearly special directions in the stimu- 
lus space independent of assumptions about -P(s). This notion can be quantified by the 
Shannon information. We note that the stimuli s do not have to lie on a low-dimensional 
manifold within the overall D dimensional space. However, since we assume that the 
neuron's input-output function depends on a small number of relevant dimensions, the 
ensemble of stimuli conditional on a spike may exhibit clear clustering. This makes the 
proposed method of looking for the RS complimentary to the clustering of stimuli condi- 
tional on a s pike done in the information bottleneck method l|Tishbyetal.l. E999): see also 
([Dimitrov & Miller. 2001,) . Non-information based measures of similarity between prob- 
ability distributions P(s) and P(s| spike) have also been proposed to find the RS (,Paninski. 



2003a) 



To summarize our assumptions: 

• The sampling of the probability distribution of stimuli P(s) is ergodic and stationary 
across repetitions. The probability distribution is not assumed to be Gaussian. The 
ensemble of stimuli described by P(s) does not have to lie on a low-dimensional 
manifold embedded in the overall D-dimensional space. 

• We choose a single spike as the response of interest (for illustration purposes only). 
An identical scheme can be applied, for example, to particular interspike intervals 
or to synchronous spikes from a pair of neurons. 

• The subspace relevant for generating a spike is low dimensional and Euclidean, cf. 
Eq.lll 

• The input-output function, which is defined within the low dimensional RS, can 
be arbitrarily nonlinear. It is obtained experimentally by sampling the probability 
distributions P(s) and P(s|spike) within the RS. 

The paper is organized as follows. In Sec.|2lwe discuss how an optimization problem 
can be formulated to find the RS. A particular algorithm used to implement the optimiza- 
tion scheme is described in Sec. |3l In Sec. HI we illustrate how the optimization scheme 
works with natural stimuli for model orientation sensitive cells with one and two rele- 
vant dimensions, much like simple and complex cells found in primary visual cortex, as 
well as for a model auditory neuron responding to natural sounds. We also discuss the 
convergence of our estimates of the RS as a function of data set size. We emphasize that 
our optimization scheme does not rely on any specific statistical properties of the stimulus 
ensemble, and thus can be used with natural stimuli. 



*If one suspects that neurons are sensitive to low dimensional features of their input, 
one might be tempted to analyze neural responses to stimuli that explore only the (puta- 
tive) relevant subspa ce, perhaps along the line of the subspace reverse correlation method 
jRingach et al.l.E997|) . Our approach (like the spike-triggered covariance approach) is dif- 
ferent because it allows the analysis of responses to stimuli that live in the full space, and 
instead we let the neuron "tell us" which low dimensional subspace is relevant. 
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2 Information as an objective function 



When analyzing neural responses, we compare the a priori probability distribution of 
all prese nted st imuli with the probability distribution of stimuli which lead to a spike 
Ide Ruy ter van Steveninck & Bialek, 1988). For Gaussian signals, the probability distri- 
bution can be characterized by its second moment, the covariance matrix. However, an 
ensemble of natural stimuli is not Gaussian, so that in a general case neither second nor 
any finite number of moments is sufficient to describe the probability distribution. In this 
situation. Shannon information provides a rigorous way of comparing two probability 
dist ributions. The avera ge information carried by the arrival time of one spike is given 
by ( Brenner et al.l.l2nnnb|) 



/spike = J (isP(s|spike) logs 



P(s|spike) 
Pis) 



(3) 



where ds denotes integration over full D-dimensional stimulus space. The information 
per spike as written in is difficult to estimate experimentally, since it requires either 
sampling of the high-dimensional probability distribution P(s| spike) or a model of how 
spikes were generated, i.e. the knowledge of low-dimensional RS. However it is possible 
to calculate /spike in a model-independent way, if stimuli are presented multiple times to 
estimate the probability distribution P(spike|s). Then, 



/ P(spike|s) 
^^P^-^^ - \ P(spike) 



P(spike|s) 
P(spike) 



(4) 



where the average is taken over all presented stimuli. This can be useful in practice 
dBrenne r et aO-bonn b^. because we can replace the ensemble average ()s with a time aver- 
age, and P(spike|s) with the time dependent spike rate r{t). Note that for a finite dataset 
of repetitions, the obtained value /spike (^) will be on average larger than /spike (oo). 
The true value /spike can be found eithe r bv subtracting an expected bias value, which 
is of the order of ~ l/('Pfspike)iy 21n2) dTreves & Panzeril. E995L IPanzeri & Trevesl.ll996 : 
Pola et al.', '2002'; 'Paninski', '2003b"), or by extrapolating to ^ oo ferermer e t al.l. l2000b : 



Strong et al... J398}. Measurement of /gpikc in this way provides a model independent 
benchmark against which we can compare any description of the neuron's input-output 
relation. 

Our assumption is that spikes are generated according to a projection onto a low di- 
mensional subspace. Therefore to characterize relevance of a particular direction v in the 
stimulus space, we project all of the presented stimuli onto v and form probability dis- 
tributions Pvix) and Pv(2;|spike) of projection values x for the a priori stimulus ensemble 
and that conditional on a spike, respectively: 

Pv(x) = (5(x-s-v))„ (5) 
Pv(x|spike) = — s ■ v)|spike)s, (6) 

where 6{x) is a delta-function. In practice, both the average (■ ■ ■)s = J ds - ■ ■P(s) over 
the a priori stimulus ensemble, and the average (■ ■ • | spike) s = J ds - ■ ■ P(s| spike) over the 
ensemble conditional on a spike, are calculated by binning the range of projections values 
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X. The probability distributions are then obtained as histograms, normalized in a such a 
way that the sum over all bins gives 1. The mutual information between spike arrival 
times and the projection x, by analogy with Eq. is 



/(v) = / (ixPv(a;|spike) log2 



Pv (a; I spike) 



(7) 



which is also the KuUback-Leibler divergence D [Pv(x|spike)||Pv(x)]; notice that this in- 
formation is a function of the direction v. The information /(v) provides an invariant 
measure of how much the occurrence of a spike is determined by projection on the di- 
rection v. It is a function only of direction in the stimulus space and does not change 
when vector v is multiplied by a constant. This can be seen by noting that for any prob- 
ability distribution and any constant c, Pcv{x) = c~^P^{x/c) [see also Theorem 9.6.4 of 
Cover & ThomasI (Il99l|) 1. When evaluated along any vector v, /(v) < /spike- The total 
information /spike can be recovered along one particular direction only if v = ei, and only 
if the RS is one dimensional. 

By analogy with 0, one could also calculate information /(vi, v„) along a set of 
several directions {vi, v„} based on the multi-point probability distributions of projec- 
tion values Xi, X2, ... a;„ along vectors vi, V2, ... v„ of interest: 



If we are successful in finding all of the directions {cj}, 1 < i < K contributing to the 
input-output relation (|T]|, then the information evaluated in this subspace will be equal 
to the total information /spike- When we calculate information along a set of K vectors 
that are slightly off from the RS, the answer, of course, is smaller than /spike and is initially 
quadratic in small deviations 5vi. One can therefore hope to find the RS by maximizing 
information with respect to K vectors simultaneously. The information does not increase 
if more vectors outside the RS are included. For uncorrelated stimuli, any vector or a 
set of vectors that maximizes /(v) belongs to the RS. On the other hand, as discussed in 
Appendix |Bl the result of optimization with respect to a number of vectors k < K may 
deviate from the RS if stimuli are correlated. To find the RS, we first maximize /(v), and 
compare this maximum with /spike/ which is estimated according to (Hjl. If the difference 
exceeds that expected from finite sampling corrections, we increment the number of di- 
rections with respect to which information is simultaneously maximized. 



In this section we describe a particular algorithm we used to look for the most informative 
dimensions in order to find the relevant subspace. We make no claim that our choice of 
the algorithm is most efficient. However, it does give reproducible results for different 
starting points and spike trains with differences taken to simulate neural noise. Overall, 




(9) 



(8) 



3 Optimization algorithm 
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choices for an algorithm are broader because the information /(v) as defined by (|7|l is a 
continuous function, whose gradient can be computed. We find (see Appendix |Cl for a 
derivation) 



Vv/ 



j dxPv{x) [(s|x, spike) — (s|x)] 



d Pv (a; I spike) 
dx Pv(x) 



where 



(s|a;, spike) 



P(x|spike' 



J ds sS{x — s ■ v)P(s|spike), 



(10) 



(11) 



and similarly for (s|x). Since information does not change with the length of the vector, 
we have v ■ Vv/ = 0, as also can be seen directly from Eq. (ITOb . 

As an optimization algorithm, we have used a combination of gradient ascent and 
simulated annealing alg orithms: successive line maximizations were done along the di- 
rection of the gradient (|Press et al.. 1992) . During line maximizations, a point with a 
smaller value of information was accepted according to Boltzmann statistics, with proba- 
bility oc exp[(/(vj+i) — /(vj))/T]. The effective temperature T is reduced by factor of 1 — 
upon completion of each line maximization. Parameters of the simulated annealing algo- 
rithm to be adjusted are the starting temperature To and the cooling rate ey, AT = —txT. 
When maximizing with respect to one vector we used values To = 1 and ct = 0.05. 
When maximizing with respect to two vectors, we either used the cooling schedule with 
er = 0.005 and repeated it several times (4 times in our case) or allowed the effective tem- 
perature T to increase by a factor of 10 upon convergence to a local maximum (keeping 
T < Tq always), while limiting the total number of line maximizations. 

The problem of maximizing a function often is related to the problem of making a good 
initial guess. It turns out, however, that the choice of a starting point is much less crucial 
in cases where the stimuli are correlated. To illustrate this point we plot in Fig. |2l the 
probability distribution of information along random directions v both for white noise 
and for naturalistic stimuli in a model with one relevant dimension. For uncorrelated 
stimuli, not only is information equal to zero for a vector that is perpendicular to the 
relevant subspace, but in addition the derivative is equal to zero. Since a randomly chosen 
vector has on average a small projection on the relevant subspace (compared to its length) 



n/d, the corresponding information can be found by expanding in fr/|v| 



2|v| 



dxPp 




[(scrlspike) — (se^)]" 



(12) 



where vector v = VrCr + VirCir is decomposed in its components inside and outside the RS, 
respectively. The average information for a random vector is, therefore, ~ = 

K/D. 

In cases where stimuli are drawn from a Gaussian ensemble with correlations, an ex- 
pression for the information values has a similar structure to (|T2)l . To see this, we trans- 
form to Fourier space and normalize each Fourier component by the square root of the 
power spectrum ^(k). In this new basis, both the vectors {e/}, I < i < K, forming the RS 
and the randomly chosen vector v along which information is being evaluated are to be 
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Figure 2: The probability distribution of information values in units of the total informa- 
tion per spike in the case of (a) uncorrelated binary noise stimuli, (b) correlated Gaussian 
noise with power spectrum of natural scenes, and (c) stimuli derived from natural scenes 
(patches of photos). The distribution was obtained by calculating information along 10^ 
random vectors for a model cell with one relevant dimension. Note the different scales in 
the two panels. 



multiplied by y S(k). Thus, if we now substitute for the dot product v"^ the convolution 
weighted by the power spectrum, Y^f^i^ * Cj)^, where 



V * e,; 



Ek^(k)e,(k)5(k) 



Ek^^^(k)5(k)v'Ekef(k)5(k) 



(13) 



then Eq. (IT2t will describe information values along randomly chosen vectors v for corre- 
lated Gaussian stimuli with the power spectrum ^(k). Even though both and L'(k) are 
Gaussian variables with variance ~ 1 /-D, the weighted convolution has not only a much 
larger variance but is also strongly non-Gaussian [the non-Gaussian character is due to 
the normalizing factor Ek ^^^(k)S'(k) in the denominator of Eq. dlStl. As for the variance, it 
can be estimated as < (v*ei)^ >= Atc/ D,m cases where stimuli are taken as patches of 
correlated Gaussian noise with the two-dimensional power spectrum S'(k) = A/k"^. The 
large values of the weighted dot product y * Ci, 1 < i < K result not only in significant 
information values along a randomly chosen vector, but also in large magnitudes of the 
derivative V/, which is no longer dominated by noise, contrary to the case of uncorre- 
lated stimuli. In the end, we find that randomly choosing one of the presented frames as 
a starting guess is sufficient. 



4 Results 

We tested the scheme of looking for the most informative dimensions on model neurons 
that respond to stimuli derived from natural scenes and sounds. As visual stimuli we 
used scans across natural scenes, which were taken as black and white photos digitized 
to 8 bits with no corrections made for the camera's light intensity transformation func- 
tion. Some statistical properties of the stimulus set are shown in Fig. |3l Qualitatively, 
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Figure 3: Statistical properties of the visual stimulus ensemble. Panel (a) shows one of 
the photos. Stimuli would be 30x30 patches taken from the overall photograph. In panel 
(b) we show the power spectrum, in units of light intensity variance cr^(/), averaged over 
orientation as a function of dimensionless wave vector ka, where a is the pixel size, (c) 
The probability distribution of light intensity in units of cr{I). (d) The probability distribu- 
tion of projections between stimuli and a Gabor filter, also in units of the corresponding 
standard deviation cr(si). 



they reproduce the known results on the statisti cs of natural scenes ("Rude rman & BialeH 
19941: lRudermanl.ll994j: bong & AticnE995.:.SimonceUi & Olshauseni..2001^ . Most impor- 
tant properties for this study are strong spatial correlations, as evident from the power 
spectrum S(k) plotted in panel (b), and deviations of the probability distribution from a 
Gaussian one. The non-Gaussian character can be seen in panel (c), where the 
probability distribution of intensities is shown, and in panel (d) which shows the distri- 
bution of projections on a Gabor filter [in what follows the units of projections, such as si, 
will be given in units of the corresponding standard deviations]. Our goal is to demon- 
strate that even though the correlations present in the ensemble are non-Gaussian, they 
can be removed successfully from the estimate of vectors defining the RS. 



4.1 A model simple cell 

Our first example is based on the properties of simple cells found in the primary visual 
cortex. A model phase and orientation sensitive cell has a single relevant dimension ei 
shown in Fig.|4|^a). A given stimulus s leads to a spike if the projection si = s ■ ei reaches 
a threshold value 9 in the presence of noise: 



P(spike|s) 
P(spike) 



f(s,) = {His,-e + 0), (14) 



where a Gaussian random variable ^ of variance cr^ models additive noise, and the func- 
tion H(x) = 1 for X > 0, and zero otherwise. Together with the RF ei, the parameters 9 for 
threshold and the noise variance cr^ determine the input-output function. 
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When the spike-triggered average (STA), or reverse correlation function, is computed 
from the responses to correlated stimuli, the resulting vector will be broadened due to 
spatial correlations present in the stimuli (see Fig.|^). For stimuli that are drawn from a 
Gaussian probability distribution, the effects of correlations could be removed by multi- 
plying Vsta by the inverse of the a priori covariance matrix, according to the reverse cor- 
relation method, vcaussianest oc C~^igj.{Vsia.r Eq. (l20t . However this procedure tends to 
amplify noise. To separate errors due to neural noise from those due to the non-Gaussian 
character of correlations, note that in a model the effect of neural noise on our estimate 
of the STA can be eliminated by averaging the presented stimuli weighted with the exact 
firing rate, as opposed to using a histogram of responses to estimate P (spike |s) from a 
finite set of trials. We have used this "exact" STA, 

Vsta = j rfssP(s|spike) = ^^^^.^^^ j dsP(s) sP(spike|s), (15) 

in calculations presented in Figs. HI (b) and (c). Even with this noiseless STA (the equiva- 
lent of collecting an infinite data set), the standard decorrelation procedure is not valid for 
non-Gaussian stimuli and nonlinear input-output functions, as discussed in detail in Ap- 
pendix|Al The result of such a decorrelation in our example is shown in Fig.|4|Jc). It clearly 
is missing some of the structure in the model filter, with projection ei ■ vcaussianest ~ 0.14. 
The discrepancy is not due to neural noise or finite sampling, since the "exact" STA was 
decorrelated; the absence of noise in the exact STA also means that there would be no 
justification for smoothing the results of the decorrelation. The discrepancy between the 
true receptive field and the decorrelated STA increases with the strength of nonlinearity 
in the input-output function. 

In contrast, it is possible to obtain a good estimate of the relevant dimension ci by max- 
imizing information directly, see panel (d). A typical progress of the simulated annealing 
algorithm with decreasing temperature T is shown in Fig. IHje). There we plot both the 
information along the vector, and its projection on ei. We note that while information / 
remains almost constant, the value of projection continues to improve. Qualitatively this 
is because the probability distributions depend exponentially on information. The final 
value of projection depends on the size of the data set, as discussed below. In the example 
shown in Fig. Hlthere were ~ 50, 000 spikes with average probability of spike ~ 0.05 per 
frame, and the reconstructed vector has a projection v.max ■ ei = 0.920 ± 0.006. Having 
estimated the RF, one can proceed to sample the nonlinear input-output function. This is 
done by constructing histograms for P(s ■ ^max) and P(s ■ {}jnax|spike) of projections onto 
vector {}max found by maximizing information, and taking their ratio, as in Eq. (O. In 
Fig.Stf) we compare P (spike |s ■ t)max) (crosses) with the probability P(spike|si) used in the 
model (solid line). 

4.2 Estimated deviation from the optimal dimension 

When information is calculated from a finite data set, the (normalized) vector v which 
maximizes / will deviate from the true RF ei. The deviation 5^ = v — ei arises because 
the probability distributions are estimated from experimental histograms and differ from 
the distributions found in the limit of infinite data size. For a simple cell, the quality of 
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Figure 4: Analysis of a model simple cell with RF shown in (a). The "exact" spike- 
triggered average Vgta is shown in (b). Panel (c) shows an attempt to remove correlations 
according to reverse correlation method, C^lriori^sta', (d) the normalized vector i\aa.x found 
by maximizing information; (e) convergence of the algorithm according to information 
/(v) and projection v ■ ei between normalized vectors as a function of inverse effective 
temperature T~^. (f) The probability of a spike P(spike|s • Vmax) (crosses) is compared to 
P(spike|si) used in generating spikes (solid line). Parameters of the model are a = 0.31 
and 6 = 1.84, both given in units of standard deviation of si, which is also the units for 
a;-axis in panel (f). 

reconstruction can be characterized by the projection v ■ ii = 1 — ^Sv^, where both v and 
ei are normalized, and 5v is by definition orthogonal to ei. The deviation 5v ~ A^^VI, 
where A is the Hessian of information. Its structure is similar to that of a covariance 
matrix: 



1 



J dxP{x\spike) ( ^ 1^ 



d P(a; I spike) 



P(x) 



{{SiSj\x) - {si\x){sj\x)). 



(16) 



When averaged over possible outcomes of N trials, the gradient of information is zero 
for the optimal direction. Here in order to evaluate (5v^) = Tr[A^^(V/V/^)/l~^], we need 
to know the variance of the gradient of /. Assuming that the probability of generating 
a spike is independent for different bins, we can estimate (V/,V/j) ~ Ai/(^spikeln2). 
Therefore an expected error in the reconstruction of the optimal filter is inversely pro- 
portional to the number of spikes. The corresponding expected value of the projection 
between the reconstructed vector and the relevant direction ei is given by: 



V ■ ei 



Tt'[A- 



2iV,pikeln2' 



(17) 
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Figure 5: Projection of vector Vmax that maximizes information on RF ii is plotted as 
a function of the number of spikes. The solid line is a quadratic fit in l/A^gpike, and 
the dashed line is the leading linear term in 1/A^spike- This set of simulations was car- 
ried out for a model visual neuron with one relevant dimension from Fig. |4{a) and the 
input/output function (IT4b with parameter values cr ^ 0.61a(si), 6 ^ 0.61cr(si). For 
this model neuron, the linear approximation for the expected error is applicable for 

iVspike > 30, 000. 

where Tr' means that the trace is taken in the subspace orthogonal to the model filter^. 
The estimate dTTt can be calculated without knowledge of the underlying model, it is 
~ D / (2Nspi]^(,). This behavior should also hold in cases where the stimulus dimensions 
are expanded to include time. The errors are expected to increase in proportion to the 
increased dimensionality. In the case of a complex cell with two relevant dimensions, the 
error is expected to be twice that for a cell with single relevant dimension, also discussed 
in section 1431 

We emphasize that the error estimate according to Eq. ((TTjl is of the same order as er- 
rors of the reverse correlation method when it is applied for Gaussian ensembles. The 
latter are given by (Tr[C^^] — Cfi^)/[2A^spike(/'^(si))]. Of course, if the reverse correlation 
method were to be applied to the non-Gaussian ensemble, the errors would be larger. 
In Fig. 13 we show the result of simulations for various numbers of trials, and therefore 
^spikc- The average projection of the normalized reconstructed vector v on the RF ei be- 
haves initially as 1/A^spikc (dashed line). For smaller data sets, in this case Ai'spikcs ^ 30, 000, 
corrections ~ N'^^y-^^ become important for estimating the expected errors of the algo- 
rithm. Happily these corrections have a sign such that smaller data sets are more effective 
than one might have expected from the asymptotic calculation. This can be verified from 
the expansion v ■ ii = [1 — 5v^] ^ 1 — |(5v^) + |(5v^), where only the first two terms 

^By definition 6vi = 5v ■ ei = 0, and therefore {Svf) oc Aii is to be subtracted from 
(5v^) oc Tr[y4^^]. Because ei is an eigenvector of A with zero eigenvalue, Aii is infinite. 
Therefore the proper treatment is to take the trace in the subspace orthogonal to ei. 
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model reconstruction 
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Figure 6: Analysis of a model complex cell with relevant dimensions ei and 62 shown in 
(a) and (b). Spikes are generated according to an "OR" input-output function /(si,S2) 
with the threshold 9 ^ 0.61cr(si) and noise standard deviation a = 0.31(t(si). In panels (c) 
and (d), we show vectors vi and V2 found by maximizing information /(vi, V2). 



where taken into account in Eq. [TTl 



4.3 A model complex cell 

A sequence of spikes from a model cell with two relevant dimensions was simulated by 
projecting each of the stimuli on vectors that differ by n /2 in their spatial phase, taken 
to mimic properties of complex cells, as in Fig. |6l A particular frame leads to a spike 
according to a logical OR, that is if either |si| or |s2| exceeds a threshold value 9 in the 
presence of noise, where si = s ■ ei, S2 = s • 62- Similarly to ((T4)l . 

P(spike|s) 



P(spike) 



/(si,S2) = (i/(|si|-^-ei) V //(|s2|-e-6)), (18) 



where ,^1 and ^2 are independent Gaussian variables. The sampling of this input-output 
function by our particular set of natural stimuli is shown in Fig. [SJc). As is well known, 
reverse correlation fails in this case because the spike-triggered average stimulus is zero, 
although with Gaussian stimuli the spike-triggered covariance method would recover 
the relevant dimensions (|Touryan et al.. .200 2). Here we show that searching for max- 
imally informative dimensions allows us to recover the relevant subspace even under 
more natural stimulus conditions. 

We start by maximizing information with respect to one vector. Contrary to the result 
Fig. mje) for a simple cell, one optimal dimension recovers only about 60% of the total 
information per spike [Eq. Perhaps surprisingly, because of the strong correlations 
in natural scenes, even a projection onto a random vector in the D ~ 10'^ dimensional 
stimulus space has a high probability of explaining 60% of total information per spike. 
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as can be seen in Fig. |2l We therefore go on to maximize information with respect to 
two vectors. As a result of maximization, we obtain two vectors vi and V2 shown in 
Fig. |6l The information along them is /(vi, V2) ~ 0.90 which is within the range of in- 
formation values obtained along different linear combinations of the two model vectors 
-^(ci, e2)//spike = 0.89 ± 0.11. Therefore, the description of neuron's firing in terms of vec- 
tors vi and V2 is complete up to the noise level, and we do not have to look for extra 
relevant dimensions. Practically, the number of relevant dimensions can be determined 
by comparing /(vi, V2) to either /spike or /(vi, V2, V3), the later being the result of maxi- 
mization with respect to three vectors simultaneously. As mentioned in the Introduction, 
information along set a of vectors does not increase when extra dimensions are added to 
the relevant subspace. Therefore, if /(vi, V2) = /(vi, V2, V3) (again, up to the noise level), 
then this means that there are only 2 relevant dimensions. Using /gpikc for comparison 
with /(vi, V2) has the advantage of not having to look for an extra dimension, which can 
be computationally intensive. However /spike rnight be subject to larger systematic bias 
errors than /(vi, V2, V3). 

Vectors vi and V2 obtained by maximizing /(vi, V2) are not exactly orthogonal, and 
are also rotated with respect to ei and 62. However, the quality of reconstruction, as well 
as the value of information /(vi, V2), is independent of a particular choice of basis with 
the RS. The appropriate measure of similarity between the two planes is the dot product 
of their normals. In the example of Fig. |6l ri{ei,e2) ' "^(vi,v2) = 0.82 ± 0.07, where is 
a normal to the plane passing through vectors ei and 62- Maximizing information with 
respect to two dimensions requires a significantly slower cooling rate, and consequently 
longer computational times. However, the expected error in the reconstruction, 1 — n(ei ,62) ' 
5T,(vi,v2)/ scales as l/A'spike behavior, similarly to dTTt , and is roughly twice that for a simple 
cell given the same number of spikes. We make vectors vi and V2 orthogonal to each 
others upon completion of the algorithm. 

4.4 A model auditory neuron with one relevant dimension 

Because stimuli s are treated as vectors in an abstract space, the method of looking for the 
most informative dimensions can be applied equally well to auditory as well as to visual 
neurons. Here we illustrate the method by considering a model auditory neuron with one 
relevant dimension, which is shown in Fig. |7fc) and is taken to mimic the properties of 
cochlear neurons. The model neuron is probed by two ensembles of naturalistic stimuli: 
one is a recording of a native Russian speaker reading a piece of Russian prose, and the 
other one is a recording of a piece of English prose read by a native English speaker. Both 
of the ensembles are non-Gaussian and exhibit amplitude distributions with long, nearly 
exponential tails, cf . Fig. IZfa), which are qualitatively similar to those of light intensities 
in natural scenes ( Voss & Clarke. .1975: .Rudermarv .1994 ). However, the power spectrum 
is different in the two cases, as can be seen in Fig. |7|[b). The differences in the correlation 
structure in particular lead to different STAs across the two ensembles, cf. panel (d). Both 
of the STAs also deviate from the model filter shown in panel (c). 

Despite differences in the probability distributions P(s), it is possible to recover the 
relevant dimension of the model neuron by maximizing information. In panel (e) we 
show the two most informative vectors found by running the algorithm for the two en- 
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Amplitude (units of RMS) frequency (l<Hz) 




Figure 7: A model auditory neuron is probed by two natural ensembles of stimuli: a piece 
of English prose (1) and a piece of of Russian prose (2) . The size of the stimulus ensemble 
was the same in both cases, and the sampling rate was 44.1 kHz. (a) The probability distri- 
bution of the sound pressure amplitude in units of standard deviation for both ensembles 
is strongly non-Gaussian, (b) The power spectra for the two ensembles, (c) The rele- 
vant vector of the model neuron, of dimensionality D = 500. (d) The STA is broadened 
in both cases, but differs among the two cases due to differences in the power spectra 
of the two ensembles, (e) Vectors that maximize information for either of the ensembles 
overlap almost perfectly with each other, and with the model filter from (a), which is also 
replotted here from (c). (f) The probability of a spike P(spikc|s ■ v„iax) (crosses) is com- 
pared to P(spike|si) used in generating spikes (solid line). The input-output function had 
parameter values a ^ 0.9a(si) and 9 ~ 1.8a(si). 

sembles, and replot the model filter from (c) to show that the three vectors overlap almost 
perfectly. Thus different non-Gaussian correlations can be successfully removed to ob- 
tain an estimate of the relevant dimension. If the most informative vector changes with 
the stimulus ensemble, this can be interpreted as caused by adaptation to the probability 
distribution. 

5 Summary 

Features of the stimulus that are most relevant for generating the response of a neuron 
can be found by maximizing information between the sequence of responses and the 
projection of stimuli on trial vectors within the stimulus space. Calculated in this man- 
ner, information becomes a function of direction in stimulus space. Those vectors that 
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maximize the information and account for the total information per response of interest 
span the relevant subspace. The method allows multiple dimensions to be found. The 
reconstruction of the relevant subspace is done without assuming a particular form of 
the input-output function. It can be strongly nonlinear within the relevant subspace, and 
is estimated from experimental histograms for each trial direction independently. Most 
importantly, this method can be used with any stimulus ensemble, even those that are 
strongly non-Gaussian as in the case of natural signals. We have illustrated the method 
on model neurons responding to natural scenes and sounds. We expect the current im- 
plementation of the method to be most useful in cases where several most informative 
vectors (< 10, depending on their dimensionality) are to be analyzed for neurons probed 
by natural scenes. This technique could be particularly useful in describing sensory pro- 
cessing in poorly understood regions of higher level sensory cortex (such as visual areas 
V2, V4 and IT and auditory cortex beyond Al) where white noise stimulation is known 
to be less effective than naturalistic stimuli. 
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A Limitations of the reverse correlation method 

Here we examine what sort of deviations one can expect when applying the reverse cor- 
relation method to natural stimuli even in the model with just one relevant dimension. 
There are two factors that, when combined, invalidate the reverse correlation method: the 
non-Gaussian chara cter of correlations and the n onlinearity of the input/ o utput function 
jRingac h et al. , 1997). In its original formulation ( de Boer & Kuypei . 1968h , the neuron is 



probed by white noise and the relevant dimension ei is given by the STA ei oc (sr(s)). If 
the signals are not white, i.e. the covariance matrix Cij = {siSj) is not a unit matrix, then 
the STA is a broadened version of the original filter ei. This can be seen by noting that for 
any function F{s) of Gaussian variables {sj} the identity holds: 

{siF{s)) = {s,s,){ds^F{s)), d, = ds^. (19) 



When property u9t is applied to the vector components of the STA, (sjr (s)) = Cij {djr{s)). 
Since we work within the assumption that the firing rate is a (nonlinear) function of pro- 
jection onto one filter ei, r(s) = r(si), the later average is proportional to the model filter 
itself, (djr) = eij (r'(si)). Therefore, we arrive at the prescription of the reverse correlation 
method 

eu<x[C-%{s,r{s)). (20) 
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The Gaussian property is necessary in order to represent the STA as a convolution of the 
covariance matrix Cij of the stimulus ensemble and the model filter. To understand how 
the reconstructed vector obtained according to Eq. (l20b deviates from the relevant one, we 
consider weakly non-Gaussian stimuli, with the probability distribution 

P„g(s) = ^Po(s)e^^^(^), (21) 

where -Po(s) is the Gaussian probability distribution with covariance matrix C, and the 
normalization factor Z = (e"^^*-^^). The function Hi describes deviations of the probability 
distribution from Gaussian, and therefore we will set (sjifi) = and {siSjHi) = 0, since 
these averages can be accounted for in the Gaussian ensemble. In what follows we will 
keep only the first order terms in perturbation parameter e. Using the property (IT9t , we 
find the STA to be given by 

{s^r)nG = {s^s,) [{d^r) + e{rdj{Hi))] , (22) 

where averages are taken with respect to the Gaussian distribution. Similarly, the covari- 
ance matrix Cij evaluated with respect to the non-Gaussian ensemble is given by: 

Cij = ^{siSjC^^^) = (siSj) + e{siSk){sjdk{Hi)) (23) 

so that to the first order in e, {si-Sj) = Cij — eCik{sjdk{Hi)). Combining this with Eq. (Eb , 
we get 

{sir) nG = const x djiij + eCij ( (r - si (r') ) dj {Hi)). (24) 



The second term in (|24|l prevents the application of the reverse correlation method for 
non-Gaussian signals. Indeed, if we multiply the STA ((24|l with the inverse of the a pri- 
ori covariance matrix Cij according to the reverse correlation method ((20)l . we no longer 
obtain the RE ei. The deviation of the obtained answer from the true RE increases with 
e, which measures the deviation of the probability distribution from Gaussian. Since nat- 
ural stimuli are known to be strongly non-Gaussian, this makes the use of the reverse 
correlation problematic when analyzing neural responses to natural stimuli. 

The difference in applying the reverse correlation to stimuli drawn from a correlated 
Gaussian ensemble vs. a non-Gaussian one is illustrated in Eigs. IS] (b) and (c). In the 
first case, shown in (b), stimuli are drawn from a correlated Gaussian ensemble with the 
covariance matrix equal to that of natural images. In the second case, shown in (c), the 
patches of photos are taken as stimuli. The STA is broadened in both cases. Even though 
the two-point correlations are just as strong in the case of Gaussian stimuli as they are in 
the natural stimuli ensemble, Gaussian correlations can be successfully removed from the 
STA according to Eq. ((20)l to obtain the model filter. On the contrary, an attempt to use 
reverse correlation with natural stimuli results in an altered version of the model filter. We 
reiterate that for this example the apparent noise in the decorrelated vector is not due to 
neural noise or finite datasets, since the "exact" STA has been used (ITSb in all calculations 
presented in Eigs. |S1 and 
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Figure 8: The non-Gaussian character of correlations present in natural scenes invalidates 
the reverse correlation method for neurons with a nonlinear input-output function. Here, 
a model visual neuron has one relevant dimension ei and the nonlinear input/ output 
function, shown in (a). The "exact" STA is used (ITSt to separate effects of neural noise 
from alterations introduced by the method. The decorrelated "exact" STA is obtained by 
multiplying the "exact" STA by the inverse of the covariance matrix, according to Eq. (l20b . 

(b) Stimuli are taken from a correlated Gaussian noise ensemble. The effect of correlations 
in STA can be removed according to Eq. (EUb . When patches of photos are taken as stimuli 

(c) for the same model neuron as in (b), the decorrelation procedure gives an altered 
version of the model filter. The two stimulus ensembles have the same covariance matrix. 

The reverse correlation method gives the correct answer for any distribution of signals 
if the probability of generating a spike is a linear function of Sj, since then the second 
term in Eq. (|24)l is zero. In particular, a linear input-output relation could arise due to 
a neural noise whose variance is much larger than the variance of the signal itself. This 
point is illustrated in Figs. ^ (a), (b), and (c), where the reverse correlation method is 
applied to a threshold input-output function at low, moderate, and high signal-to-noise 
ratios. For small signal-to-noise ratios where the noise standard deviation is similar to 
that of projections si, the threshold nonlinearity in the input-output function is masked 
by noise, and is effectively linear. In this limit, the reverse correlation can be applied with 
the exact STA. However, for experimentally calculated STA at low signal-to-noise ratios 
the decorrelation procedure results in strong noise amplification. On the other hand, at 
higher signal-to-noise ratios decorrelation fails due to the nonlinearity of the input-output 
function in accordance with (|24|l . 
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Figure 9: Application of the reverse correlation method to a model visual neuron with 
one relevant dimension ci and a threshold input/ output function of decreasing values 
of noise variance a/a{si)s 6.1, 0.61, 0.06 in (a), (b), and (c) respectively. The model 
P(spike|si) becomes effectively linear when signal-to noise ratio is small. The reverse 
correlation can be used together with natural stimuli, if the input-output function is linear. 
Otherwise, the deviations between the decorrelated STA and the model filter increase 
with nonlinearity of P(spike|si). 



B Maxima of I{v): what do they mean? 

The relevant subspace of dimensionality K can be found by maximizing information si- 
multaneously with respect to K vectors. The result of maximization with respect to a 
number of vectors that is less than the true dimensionality of the relevant subspace may 
produce vectors which have components in the irrelevant subspace. This happens only in 
the presence of correlations in stimuli. As an illustration, we consider the situation where 
the dimensionality of the relevant subspace K — 2, and vector ei describes the most infor- 
mative direction within the relative subspace. We show here that even though the gradient 
of information is perpendicular to both ei and 62, it may have components outside the rel- 
evant subspace. Therefore the vector v^aax that corresponds to the maximum of I{v) will 
then lie outside the relevant subspace. We recall from Eq. (10) that 

V/(eO = / d.iP(sO;^^^%|^((s|si, spike) - {s\s,)), (25) 

J dsi P{si) 

We can rewrite the conditional averages (s|si) = / ds2P{si, S2)(s|si, S2)/P(si) arid 
(s|si, spike) = / (is2/(si, S2)P(si, S2)(s|si, S2)/P(si|spike), so that 

v7Tf-\ ^ u( ^/ I > P(spike|si,S2) - P(spike|si) d P(si|spike) 
VI{e,)^jds,ds,P{s,,s,){s\s,,s,) —\n ^^^^^ . (26) 
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Because we assume that the vector ei is the most informative within the relevant sub- 
space, eiV/ = 62 V/ = 0, so that the integral in (l26t is zero for those directions in which 
the component of the vector (s|si, S2) changes linearly with si and S2- For uncorrelated 
stimuli this is true for all directions, so that the most informative vector within the rele- 
vant subspace is also the most informative in the overall stimulus space. In the presence 
of correlations, the gradient may have non-zero components along some irrelevant direc- 
tions if projection of the vector (s|si, S2) on those directions is not a linear function of si 
and S2. By looking for a maximum of information we will therefore be driven outside 
the relevant subspace. The deviation of fmax from the relevant subspace is also propor- 
tional to the strength of the dependence on the second parameter S2, because of the factor 
[P(si, S2|spike)/P(si, S2) — P(si|spike)/P(si)] in the integrand. 



C The gradient of information 



According to the expression (|7|l, the information J(v) depends on the vector v only through 
the probability distributions Pvix) and Pv(x| spike). Therefore we can express the gradient 
of information in terms of gradients of those probability distributions: 



In 2 



dx 



, Pvfslspike) „ Pvfslspike) „ ,r^,s.s 
In V . ^ Vv(Pv(x|spike)) ^ ' ^ ^ Vv(Pv(a:)) 



Pv(x) 



(27) 



where we took into account that / c/xPv(x| spike) = 1 and does not change with v. To find 
gradients of the probability distributions, we note that 



VvPv(x) 



J dsP{s)5{x - s • v) =- J dsP{s)s5'{x - s • v) = \p{x){s\x)] ,(28) 



and analogously for Pv(x| spike): 

VvPv(a;|spike) = — — []9(x|spike)(s|x, spike)] . (29) 

Substituting expressions ((28)l and ((29)l into Eq. ((27|l and integrating once by parts we ob- 
tain: 



Vv/ = J dxPv{x) [(s|x, spike) — (s|a;)] ■ 
which is the expression (ITOb of the main text. 



d Pv(x|spike) 
dx Pv(x) 
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